Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 25(1): 111, 2024 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-38486135

RESUMO

BACKGROUND: DNA-binding proteins (DNA-BPs) are the proteins that bind and interact with DNA. DNA-BPs regulate and affect numerous biological processes, such as, transcription and DNA replication, repair, and organization of the chromosomal DNA. Very few proteins, however, are DNA-binding in nature. Therefore, it is necessary to develop an efficient predictor for identifying DNA-BPs. RESULT: In this work, we have proposed new benchmark datasets for the DNA-binding protein prediction problem. We discovered several quality concerns with the widely used benchmark datasets, PDB1075 (for training) and PDB186 (for independent testing), which necessitated the preparation of new benchmark datasets. Our proposed datasets UNIPROT1424 and UNIPROT356 can be used for model training and independent testing respectively. We have retrained selected state-of-the-art DNA-BP predictors in the new dataset and reported their performance results. We also trained a novel predictor using the new benchmark dataset. We extracted features from various feature categories, then used a Random Forest classifier and Recursive Feature Elimination with Cross-validation (RFECV) to select the optimal set of 452 features. We then proposed a stacking ensemble architecture as our final prediction model. Named Stacking Ensemble Model for DNA-binding Protein Prediction, or StackDPP in short, our model achieved 0.92, 0.92 and 0.93 accuracy in 10-fold cross-validation, jackknife and independent testing respectively. CONCLUSION: StackDPP has performed very well in cross-validation testing and has outperformed all the state-of-the-art prediction models in independent testing. Its performance scores in cross-validation testing generalized very well in the independent test set. The source code of the model is publicly available at https://github.com/HasibAhmed1624/StackDPP . Therefore, we expect this generalized model can be adopted by researchers and practitioners to identify novel DNA-binding proteins.


Assuntos
Algoritmos , Proteínas de Ligação a DNA , Proteínas de Ligação a DNA/metabolismo , Software , DNA/metabolismo
2.
Bioinformatics ; 39(10)2023 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-37756699

RESUMO

MOTIVATION: Spatial domain identification is a very important problem in the field of spatial transcriptomics. The state-of-the-art solutions to this problem focus on unsupervised methods, as there is lack of data for a supervised learning formulation. The results obtained from these methods highlight significant opportunities for improvement. RESULTS: In this article, we propose a potential avenue for enhancement through the development of a semi-supervised convolutional neural network based approach. Named "ScribbleDom", our method leverages human expert's input as a form of semi-supervision, thereby seamlessly combines the cognitive abilities of human experts with the computational power of machines. ScribbleDom incorporates a loss function that integrates two crucial components: similarity in gene expression profiles and adherence to the valuable input of a human annotator through scribbles on histology images, providing prior knowledge about spot labels. The spatial continuity of the tissue domains is taken into account by extracting information on the spot microenvironment through convolution filters of varying sizes, in the form of "Inception" blocks. By leveraging this semi-supervised approach, ScribbleDom significantly improves the quality of spatial domains, yielding superior results both quantitatively and qualitatively. Our experiments on several benchmark datasets demonstrate the clear edge of ScribbleDom over state-of-the-art methods-between 1.82% to 169.38% improvements in adjusted Rand index for 9 of the 12 human dorsolateral prefrontal cortex samples, and 15.54% improvement in the melanoma cancer dataset. Notably, when the expert input is absent, ScribbleDom can still operate, in a fully unsupervised manner like the state-of-the-art methods, and produces results that remain competitive. AVAILABILITY AND IMPLEMENTATION: Source code is available at Github (https://github.com/1alnoman/ScribbleDom) and Zenodo (https://zenodo.org/badge/latestdoi/681572669).

3.
Bioinformatics ; 39(6)2023 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-37285319

RESUMO

MOTIVATION: Spatial transcriptomics (ST) can reveal the existence and extent of spatial variation of gene expression in complex tissues. Such analyses could help identify spatially localized processes underlying a tissue's function. Existing tools to detect spatially variable genes assume a constant noise variance across spatial locations. This assumption might miss important biological signals when the variance can change across locations. RESULTS: In this article, we propose NoVaTeST, a framework to identify genes with location-dependent noise variance in ST data. NoVaTeST models gene expression as a function of spatial location and allows the noise to vary spatially. NoVaTeST then statistically compares this model to one with constant noise and detects genes showing significant spatial noise variation. We refer to these genes as "noisy genes." In tumor samples, the noisy genes detected by NoVaTeST are largely independent of the spatially variable genes detected by existing tools that assume constant noise, and provide important biological insights into tumor microenvironments. AVAILABILITY AND IMPLEMENTATION: An implementation of the NoVaTeST framework in Python along with instructions for running the pipeline is available at https://github.com/abidabrar-bracu/NoVaTeST.


Assuntos
Software , Transcriptoma , Perfilação da Expressão Gênica
4.
Bioinform Adv ; 3(1): vbad042, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37092035

RESUMO

Motivation: Protein structure provides insight into how proteins interact with one another as well as their functions in living organisms. Protein backbone torsion angles ( ϕ and ψ ) prediction is a key sub-problem in predicting protein structures. However, reliable determination of backbone torsion angles using conventional experimental methods is slow and expensive. Therefore, considerable effort is being put into developing computational methods for predicting backbone angles. Results: We present SAINT-Angle, a highly accurate method for predicting protein backbone torsion angles using a self-attention-based deep learning network called SAINT, which was previously developed for the protein secondary structure prediction. We extended and improved the existing SAINT architecture as well as used transfer learning to predict backbone angles. We compared the performance of SAINT-Angle with the state-of-the-art methods through an extensive evaluation study on a collection of benchmark datasets, namely, TEST2016, TEST2018, TEST2020-HQ, CAMEO and CASP. The experimental results suggest that our proposed self-attention-based network, together with transfer learning, has achieved notable improvements over the best alternate methods. Availability and implementation: SAINT-Angle is freely available as an open-source project at https://github.com/bayzidlab/SAINT-Angle. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

5.
J Comput Biol ; 30(3): 245-249, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36706434

RESUMO

Motivation: Phylogenetic trees are often inferred from a multiple sequence alignment (MSA) where the tree accuracy is heavily impacted by the nature of estimated alignment. Carefully equipping an MSA tool with multiple application-aware objectives positively impacts its capability to yield better trees. Results: We introduce Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble (MAMMLE), a framework for inferring better phylogenetic trees from unaligned sequences by hybridizing two MSA tools [i.e., Multiple Sequence Comparison by Log-Expectation (MUSCLE) and Multiple Alignment using Fast Fourier Transform (MAFFT)] with multiobjective optimization strategy and leveraging multiple maximum likelihood hypotheses. In our experiments, MAMMLE exhibits 5.57% (4.77%) median improvement (deterioration) over MUSCLE on 50.34% (37.41%) of instances.


Assuntos
Algoritmos , Software , Filogenia , Alinhamento de Sequência
6.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36460620

RESUMO

Lysine succinylation is a kind of post-translational modification (PTM) that plays a crucial role in regulating the cellular processes. Aberrant succinylation may cause inflammation, cancers, metabolism diseases and nervous system diseases. The experimental methods to detect succinylation sites are time-consuming and costly. This thus calls for computational models with high efficacy, and attention has been given in the literature to develop such models, albeit with only moderate success in the context of different evaluation metrics. One crucial aspect in this context is the biochemical and physicochemical properties of amino acids, which appear to be useful as features for such computational predictors. However, some of the existing computational models did not use the biochemical and physicochemical properties of amino acids. In contrast, some others used them without considering the inter-dependency among the properties. The combinations of biochemical and physicochemical properties derived through our optimization process achieve better results than the results achieved by combining all the properties. We propose three deep learning architectures: CNN+Bi-LSTM (CBL), Bi-LSTM+CNN (BLC) and their combination (CBL_BLC). We find that CBL_BLC outperforms the other two. Ensembling of different models successfully improves the results. Notably, tuning the threshold of the ensemble classifiers further improves the results. Upon comparing our work with other existing works on two datasets, we successfully achieve better sensitivity and specificity by varying the threshold value.


Assuntos
Algoritmos , Lisina , Lisina/metabolismo , Aminoácidos/química , Sensibilidade e Especificidade , Processamento de Proteína Pós-Traducional
7.
Artigo em Inglês | MEDLINE | ID: mdl-34928803

RESUMO

Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.


Assuntos
Algoritmos , Genômica , Animais , Filogenia , Genômica/métodos , Análise de Sequência/métodos , Escherichia coli , Genoma Bacteriano , Análise de Sequência de DNA/métodos , Mamíferos
8.
PLoS One ; 17(12): e0278095, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36454903

RESUMO

Customer churn is one of the most critical issues faced by the telecommunication industry (TCI). Researchers and analysts leverage customer relationship management (CRM) data through the use of various machine learning models and data transformation methods to identify the customers who are likely to churn. While several studies have been conducted in the customer churn prediction (CCP) context in TCI, a review of performance of the various models stemming from these studies show a clear room for improvement. Therefore, to improve the accuracy of customer churn prediction in the telecommunication industry, we have investigated several machine learning models, as well as, data transformation methods. To optimize the prediction models, feature selection has been performed using univariate technique and the best hyperparameters have been selected using the grid search method. Subsequently, experiments have been conducted on several publicly available TCI datasets to assess the performance of our models in terms of the widely used evaluation metrics, such as AUC, precision, recall, and F-measure. Through a rigorous experimental study, we have demonstrated the benefit of applying data transformation methods as well as feature selection while training an optimized CCP model. Our proposed technique improved the prediction performance by up to 26.2% and 17% in terms of AUC and F-measure, respectively.


Assuntos
Telecomunicações , Benchmarking , Sistemas Computacionais , Cabeça , Indústrias
9.
Stud Health Technol Inform ; 290: 709-713, 2022 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-35673109

RESUMO

COVID-19 pandemic is taking a toll on the social, economic, and psychological well-being of people. During this pandemic period, people have utilized social media platforms (e.g., Twitter) to communicate with each other and share their concerns and updates. In this study, we analyzed nearly 25M COVID-19 related tweets generated from 20 different countries and 28 states of USA over a month. We leveraged sentiment analysis and topic modeling over this collection and clustered different geolocations based on their sentiment. Our analysis identified 3 geo-clusters (country- and US state-based) based on public sentiment and discovered 15 topics that could be summarized under three main themes: government actions, medical issues, and people's mood during the home quarantine. The proposed computational pipeline has adequately captured the Twitter population's emotion and sentiment, which could be linked to government/policy makers' decisions and actions (or lack thereof). We believe that our analysis pipeline could be instrumental for the policymakers in sensing the public emotion/support with respect to the interventions/actions taken, for example, by the government instrumentality.


Assuntos
COVID-19 , Mídias Sociais , Humanos , Pandemias , Políticas , SARS-CoV-2
10.
Comput Biol Chem ; 98: 107661, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35339762

RESUMO

Multiple sequence alignment (MSA) is a prerequisite for several analyses in bioinformatics, such as, phylogeny estimation, protein structure prediction, etc. PASTA (Practical Alignments using SATé and TrAnsitivity) is a state-of-the-art method for computing MSAs, well-known for its accuracy and scalability. It iteratively co-estimates both MSA and maximum likelihood (ML) phylogenetic tree. It attempts to exploit the close association between the accuracy of an MSA and the corresponding tree while finding the output through multiple iterations from both directions. Currently, PASTA uses the ML score as its optimization criterion which is a good score in phylogeny estimation but cannot be proven as a necessary and sufficient criterion to produce an accurate phylogenetic tree. Therefore, the integration of multiple application-aware objectives into PASTA, which are carefully chosen considering their better association to the tree accuracy, may potentially have a profound positive impact on its performance. This paper has employed four application-aware objectives alongside ML score to develop a multi-objective (MO) framework, namely, PMAO that leverages PASTA to generate a bunch of high-quality solutions that are considered equivalent in the context of conflicting objectives under consideration. our experimental analysis on a popular biological benchmark reveals that the tree-space generated by PMAO contains significantly better trees than stand-alone PASTA. To help the domain experts further in choosing the most appropriate tree from the PMAO output (containing a relatively large set of high-quality solutions), we have added an additional component within the PMAO framework that is capable of generating a smaller set of high-quality solutions. Finally, we have attempted to obtain a single high-quality solution without using any external evidences and have found that summarizing the few solutions detected through the above component can serve this purpose to some extent.


Assuntos
Biologia Computacional , Software , Algoritmos , Filogenia , Alinhamento de Sequência
11.
Bioinformatics ; 37(21): 3734-3743, 2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34086858

RESUMO

MOTIVATION: Species tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS. Generating quartets with appropriate weights, where weights correspond to the relative importance of quartets, and subsequently amalgamating the weighted quartets to infer a single coherent species tree can allow for a statistically consistent way of estimating species trees. However, handling weighted quartets is challenging. RESULTS: We propose wQFM, a highly accurate method for species tree estimation from multi-locus data, by extending the quartet FM (QFM) algorithm to a weighted setting. wQFM was assessed on a collection of simulated and real biological datasets, including the avian phylogenomic dataset, which is one of the largest phylogenomic datasets to date. We compared wQFM with wQMC, which is the best alternate method for weighted quartet amalgamation, and with ASTRAL, which is one of the most accurate and widely used coalescent-based species tree estimation methods. Our results suggest that wQFM matches or improves upon the accuracy of wQMC and ASTRAL. AVAILABILITY AND IMPLEMENTATION: Datasets studied in this article and wQFM (in open-source form) are available at https://github.com/Mahim1997/wQFM-2020. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Especiação Genética , Genômica , Simulação por Computador , Genômica/métodos , Filogenia , Alelos
12.
BMC Med Imaging ; 21(1): 15, 2021 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-33509110

RESUMO

BACKGROUND: Segmentation of nuclei in cervical cytology pap smear images is a crucial stage in automated cervical cancer screening. The task itself is challenging due to the presence of cervical cells with spurious edges, overlapping cells, neutrophils, and artifacts. METHODS: After the initial preprocessing steps of adaptive thresholding, in our approach, the image passes through a convolution filter to filter out some noise. Then, contours from the resultant image are filtered by their distinctive contour properties followed by a nucleus size recovery procedure based on contour average intensity value. RESULTS: We evaluate our method on a public (benchmark) dataset collected from ISBI and also a private real dataset. The results show that our algorithm outperforms other state-of-the-art methods in nucleus segmentation on the ISBI dataset with a precision of 0.978 and recall of 0.933. A promising precision of 0.770 and a formidable recall of 0.886 on the private real dataset indicate that our algorithm can effectively detect and segment nuclei on real cervical cytology images. Tuning various parameters, the precision could be increased to as high as 0.949 with an acceptable decrease of recall to 0.759. Our method also managed an Aggregated Jaccard Index of 0.681 outperforming other state-of-the-art methods on the real dataset. CONCLUSION: We have proposed a contour property-based approach for segmentation of nuclei. Our algorithm has several tunable parameters and is flexible enough to adapt to real practical scenarios and requirements.


Assuntos
Colo do Útero/patologia , Detecção Precoce de Câncer/métodos , Processamento de Imagem Assistida por Computador/métodos , Teste de Papanicolaou/métodos , Neoplasias do Colo do Útero/diagnóstico , Neoplasias do Colo do Útero/patologia , Algoritmos , Núcleo Celular , Feminino , Humanos
13.
Bioinformatics ; 37(10): 1468-1470, 2021 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-33016997

RESUMO

MOTIVATION: Researchers and practitioners use a number of popular sequence comparison tools that use many alignment-based techniques. Due to high time and space complexity and length-related restrictions, researchers often seek alignment-free tools. Recently, some interesting ideas, namely, Minimal Absent Words (MAW) and Relative Absent Words (RAW), have received much interest among the scientific community as distance measures that can give us alignment-free alternatives. This drives us to structure a framework for analysing biological sequences in an alignment-free manner. RESULTS: In this application note, we present Alignment-free Dissimilarity Analysis & Comparison Tool (ADACT), a simple web-based tool that computes the analogy among sequences using a varied number of indexes through the distance matrix, species relation list and phylogenetic tree. This tool basically combines absent word (MAW or RAW) computation, dissimilarity measures, species relationship and thus brings all required software in one platform for the ease of researchers and practitioners alike in the field of bioinformatics. We have also developed a restful API. AVAILABILITY AND IMPLEMENTATION: ADACT has been hosted at http://research.buet.ac.bd/ADACT/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Nucleotídeos , Filogenia , Alinhamento de Sequência , Análise de Sequência de DNA , Software
14.
Bioinformatics ; 36(17): 4599-4608, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32437517

RESUMO

MOTIVATION: Protein structures provide basic insight into how they can interact with other proteins, their functions and biological roles in an organism. Experimental methods (e.g. X-ray crystallography and nuclear magnetic resonance spectroscopy) for predicting the secondary structure (SS) of proteins are very expensive and time consuming. Therefore, developing efficient computational approaches for predicting the SS of protein is of utmost importance. Advances in developing highly accurate SS prediction methods have mostly been focused on 3-class (Q3) structure prediction. However, 8-class (Q8) resolution of SS contains more useful information and is much more challenging than the Q3 prediction. RESULTS: We present SAINT, a highly accurate method for Q8 structure prediction, which incorporates self-attention mechanism (a concept from natural language processing) with the Deep Inception-Inside-Inception network in order to effectively capture both the short- and long-range interactions among the amino acid residues. SAINT offers a more interpretable framework than the typical black-box deep neural network methods. Through an extensive evaluation study, we report the performance of SAINT in comparison with the existing best methods on a collection of benchmark datasets, namely, TEST2016, TEST2018, CASP12 and CASP13. Our results suggest that self-attention mechanism improves the prediction accuracy and outperforms the existing best alternate methods. SAINT is the first of its kind and offers the best known Q8 accuracy. Thus, we believe SAINT represents a major step toward the accurate and reliable prediction of SSs of proteins. AVAILABILITY AND IMPLEMENTATION: SAINT is freely available as an open-source project at https://github.com/SAINTProtein/SAINT.


Assuntos
Aprendizado Profundo , Bases de Dados de Proteínas , Redes Neurais de Computação , Estrutura Secundária de Proteína , Proteínas
15.
Artif Intell Med ; 94: 28-41, 2019 03.
Artigo em Inglês | MEDLINE | ID: mdl-30871681

RESUMO

An antigen is a protein capable of triggering an effective immune system response. Protective antigens are the ones that can invoke specific and enhanced adaptive immune response to subsequent exposure to the specific pathogen or related organisms. Such proteins are therefore of immense importance in vaccine preparation and drug design. However, the laboratory experiments to isolate and identify antigens from a microbial pathogen are expensive, time consuming and often unsuccessful. This is why Reverse Vaccinology has become the modern trend of vaccine search, where computational methods are first applied to predict protective antigens or their determinants, known as epitopes. In this paper, we propose a novel, accurate computational model to identify protective antigens efficiently. Our model extracts features directly from the protein sequences, without any dependence on functional domain or structural information. After relevant features are extracted, we have used Random Forest algorithm to rank the features. Then Recursive Feature Elimination (RFE) and minimum redundancy maximum relevance (mRMR) criterion were applied to extract an optimal set of features. The learning model was trained using Random Forest algorithm. Named as Antigenic, our proposed model demonstrates superior performance compared to the state-of-the-art predictors on a benchmark dataset. Antigenic achieves accuracy, sensitivity and specificity values of 78.04%, 78.99% and 77.08% in 10-fold cross-validation testing respectively. In jackknife cross-validation, the corresponding scores are 80.03%, 80.90% and 79.16% respectively. The source code of Antigenic, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/AntigenPredictor. A publicly accessible web interface has also been established at: http://antigenic.research.buet.ac.bd.


Assuntos
Antígenos/análise , Modelos Biológicos , Algoritmos , Aminoácidos/análise , Antígenos/química , Biologia Computacional/métodos
16.
J Theor Biol ; 452: 22-34, 2018 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-29753757

RESUMO

A DNA-binding protein (DNA-BP) is a protein that can bind and interact with a DNA. Identification of DNA-BPs using experimental methods is expensive as well as time consuming. As such, fast and accurate computational methods are sought for predicting whether a protein can bind with a DNA or not. In this paper, we focus on building a new computational model to identify DNA-BPs in an efficient and accurate way. Our model extracts meaningful information directly from the protein sequences, without any dependence on functional domain or structural information. After feature extraction, we have employed Random Forest (RF) model to rank the features. Afterwards, we have used Recursive Feature Elimination (RFE) method to extract an optimal set of features and trained a prediction model using Support Vector Machine (SVM) with linear kernel. Our proposed method, named as DNA-binding Protein Prediction model using Chou's general PseAAC (DPP-PseAAC), demonstrates superior performance compared to the state-of-the-art predictors on standard benchmark dataset. DPP-PseAAC achieves accuracy values of 93.21%, 95.91% and 77.42% for 10-fold cross-validation test, jackknife test and independent test respectively. The source code of DPP-PseAAC, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/DNABinding. A publicly accessible web interface has also been established at: http://77.68.43.135:8080/DPP-PseAAC/.


Assuntos
Algoritmos , Biologia Computacional/métodos , Proteínas de Ligação a DNA/metabolismo , Máquina de Vetores de Suporte , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/genética , Aminoácidos/metabolismo , DNA/química , DNA/genética , DNA/metabolismo , Proteínas de Ligação a DNA/química , Proteínas de Ligação a DNA/genética , Bases de Dados de Proteínas , Modelos Moleculares , Conformação de Ácido Nucleico , Domínios Proteicos , Reprodutibilidade dos Testes
17.
Artif Intell Med ; 84: 90-100, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-29183738

RESUMO

The Golgi Apparatus (GA) is a key organelle for protein synthesis within the eukaryotic cell. The main task of GA is to modify and sort proteins for transport throughout the cell. Proteins permeate through the GA on the ER (Endoplasmic Reticulum) facing side (cis side) and depart on the other side (trans side). Based on this phenomenon, we get two types of GA proteins, namely, cis-Golgi protein and trans-Golgi protein. Any dysfunction of GA proteins can result in congenital glycosylation disorders and some other forms of difficulties that may lead to neurodegenerative and inherited diseases like diabetes, cancer and cystic fibrosis. So, the exact classification of GA proteins may contribute to drug development which will further help in medication. In this paper, we focus on building a new computational model that not only introduces easy ways to extract features from protein sequences but also optimizes classification of trans-Golgi and cis-Golgi proteins. After feature extraction, we have employed Random Forest (RF) model to rank the features based on the importance score obtained from it. After selecting the top ranked features, we have applied Support Vector Machine (SVM) to classify the sub-Golgi proteins. We have trained regression model as well as classification model and found the former to be superior. The model shows improved performance over all previous methods. As the benchmark dataset is significantly imbalanced, we have applied Synthetic Minority Over-sampling Technique (SMOTE) to the dataset to make it balanced and have conducted experiments on both versions. Our method, namely, identification of sub-Golgi Protein Types (isGPT), achieves accuracy values of 95.4%, 95.9% and 95.3% for 10-fold cross-validation test, jackknife test and independent test respectively. According to different performance metrics, isGPT performs better than state-of-the-art techniques. The source code of isGPT, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/isGPT.


Assuntos
Biologia Computacional/métodos , Complexo de Golgi/química , Oligopeptídeos/análise , Proteínas/análise , Máquina de Vetores de Suporte , Sequência de Aminoácidos , Animais , Bases de Dados de Proteínas , Humanos , Oligopeptídeos/classificação , Proteínas/classificação , Reprodutibilidade dos Testes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA